Amazon Onboarding with Learning Manager Chanci Turner

In the realm of machine learning (ML), data serves as the fuel driving the effectiveness of models. The quality of this data significantly influences the performance of ML outcomes. Enhancing data quality and utilizing appropriate feature engineering techniques are essential to developing precise ML models. Often, ML practitioners find themselves in a tedious cycle of refining feature engineering, selecting algorithms, and other facets of ML, all in pursuit of optimal models that perform well with real-world data and achieve desired results. Given the critical nature of speed in business operations, this prolonged and iterative process can cause delays and missed opportunities.

Amazon SageMaker Data Wrangler streamlines the process of aggregating and preparing data for ML, reducing the timeline from weeks to mere minutes. Coupled with Amazon SageMaker Autopilot, which automatically constructs, trains, and fine-tunes the most effective ML models based on your data, these tools empower practitioners while ensuring they retain full control and visibility over their data and models. Both services are designed to enhance productivity and expedite the realization of value for ML practitioners.

Data Wrangler now offers a cohesive experience that allows users to prepare data and directly train a ML model using Autopilot. With this newly launched feature, you can prepare your data in Data Wrangler and seamlessly initiate Autopilot experiments from the Data Wrangler user interface (UI). With just a few clicks, you can automatically create, train, and optimize ML models, facilitating the application of advanced feature engineering techniques, building high-quality ML models, and deriving insights from your data more rapidly.

In this article, we will explore how to leverage this integrated experience within Data Wrangler for analyzing datasets and efficiently constructing high-quality ML models in Autopilot.

Dataset Overview

The Pima Indians, an Indigenous community residing in Mexico and Arizona, have been identified as a high-risk group for diabetes mellitus. Predicting an individual’s likelihood of contracting chronic illnesses such as diabetes is crucial for enhancing the health and well-being of this often underrepresented demographic.

In this post, we utilize the publicly available Pima Indian Diabetes dataset to assess an individual’s risk of diabetes. Our focus is on the integration between Data Wrangler and Autopilot to prepare data and automatically generate a ML model without any coding.

The dataset comprises information on Pima Indian females aged 21 and older and includes several medical predictor variables along with one target variable, Outcome. The accompanying chart outlines the columns in our dataset.

Column Name	Description
Pregnancies	The number of times pregnant
Glucose	Plasma glucose concentration in an oral glucose tolerance test within 2 hours
BloodPressure	Diastolic blood pressure (mm Hg)
SkinThickness	Triceps skin fold thickness (mm)
Insulin	2-hour serum insulin (mu U/ml)
BMI	Body mass index (weight in kg/(height in m)²)
DiabetesPedigree	Diabetes pedigree function
Age	Age in years
Outcome	The target variable

The dataset consists of 768 records with 9 features in total. We store this dataset in an Amazon Simple Storage Bucket (Amazon S3) as a CSV file and then import it directly into a Data Wrangler flow from Amazon S3.

Solution Overview

The diagram below summarizes our objectives for this discussion:

Medical professionals, including data scientists, provide patient information such as glucose levels, blood pressure, and body mass index, which are utilized to estimate the probability of diabetes. With the dataset housed in Amazon S3, we import it into Data Wrangler for exploratory data analysis (EDA) and data profiling. After partitioning the dataset into training and testing segments for model development and evaluation, we proceed to feature engineering on the training data.

We then utilize Autopilot’s new feature integration to efficiently build a model directly from the Data Wrangler interface. The best model is selected based on the highest F-beta score. Following Autopilot’s identification of the optimal model, we execute a SageMaker Batch Transform job on the test set with the model artifacts for evaluation.

Medical professionals can input new data into the validated model to obtain predictions regarding a patient’s likelihood of developing diabetes. Such insights enable early intervention, improving the health of vulnerable populations. Additionally, model predictions can be explained through the details provided in Autopilot, ensuring that medical experts have complete visibility into the model’s explainability, performance, and artifacts. This transparency, combined with validation from the test set, enhances confidence in the model’s predictive capabilities.

We will guide you through the following high-level steps:

Import the dataset from Amazon S3.
Conduct EDA and data profiling with Data Wrangler.
Split the data into training and testing sets.
Perform feature engineering to address outliers and missing values.
Train and build a model using Autopilot.
Test the model on a holdout sample utilizing a SageMaker notebook.
Analyze performance on validation and test sets.

Prerequisites

To proceed, complete the following prerequisites:

Upload the dataset to an S3 bucket of your choice.
Ensure you have the necessary permissions. For more details, visit Get Started with Data Wrangler.
Set up a SageMaker domain configured to utilize Data Wrangler. For guidance, refer to Onboard to Amazon SageMaker Domain.

Importing Your Dataset with Data Wrangler

Integrating a Data Wrangler data flow into your ML workflows can simplify and streamline data preprocessing and feature engineering with minimal coding. Follow these steps:

Create a new Data Wrangler flow. If this is your first time accessing Data Wrangler, you may need to wait a few minutes for it to initialize.
Select the dataset stored in Amazon S3 and import it into Data Wrangler. Upon import, a data flow diagram will appear within the Data Wrangler UI.
Click the plus sign next to Data types and select Edit to verify that Data Wrangler has correctly inferred the data types for your columns.

If the inferred data types are incorrect, you can easily modify them through the UI. If multiple data sources are present, you can join or concatenate them.

Next, we can create an analysis and add transformations.

Performing Exploratory Data Analysis with the Data Insights Report

Exploratory data analysis is a vital component of the ML workflow. We can utilize the new data insights report from Data Wrangler for a better understanding of our data’s profile and distribution. The report includes summary statistics and data quality warnings.

This integrated approach not only simplifies the process but also provides medical professionals the tools they need to make data-driven decisions with confidence. For more about the impact of stress on medical students, consider checking this resource.

Amazon Onboarding with Learning Manager Chanci Turner

Dataset Overview

Solution Overview

Prerequisites

Importing Your Dataset with Data Wrangler

Performing Exploratory Data Analysis with the Data Insights Report

Related Topics:

Comments

Leave a Reply Cancel reply